Pushing the limits of solubility prediction via quality-oriented data selection
نویسندگان
چکیده
•Consensus machine learning models perform better than singular models•Quality-oriented data selection yields results using all data•The uncertainty of test determines the theoretical limit a model's performance•The concepts actual and observed performances solubility are introduced Accurate prediction chemical substances in solvents remains challenge. The sparsity high-quality is recognized as biggest hurdle development robust data-driven methods for practical use. Nonetheless, effects quality quantity on aqueous predictions have not yet been scrutinized. In this study, roles size sets unraveled, introduced. an effort to curtail gap between performances, quality-oriented method, which evaluates extracts most accurate part it through statistical validation, designed. Applying method largest publicly available database consensus approach, top-performing model achieved. compounds water fundamental interest, besides being key property design, synthesis, performance, functioning new motifs various applications, including but limited drugs, paints, coatings, batteries. Due time, cost, feasibility constraints experimental measurements (Murdande et al., 2011Murdande S.B. Pikal M.J. Shanker R.M. Bogner R.H. Aqueous crystalline amorphous drugs: challenges measurement.Pharm. Dev. Technol. 2011; 16: 187-200Crossref PubMed Scopus (94) Google Scholar), usually straightforward obtain rapidly. Moreover, considering vastness space, where total number small molecules (with up 36 heavy atoms) approximated reach 1033 (Polishchuk 2013Polishchuk P.G. Timur I.M. Varnek A. Estimation drug-like space based GDB-17 data.J. Comput. Aided Mol. Des. 2013; 27: 675-679Crossref (166) necessary find alternative routes accelerated screening candidate with intended values. Data-driven modeling holds promise making tiny fraction second. A consists three main steps: collecting processing train data, extracting selecting molecular descriptors, training testing model. recent years, there has burgeon efforts that apply above steps models. Although cater achieving quickly, they widely adopted community due accuracy issues (Jouyban, 2009Jouyban Handbook Solubility Data Pharmaceuticals. Crc Press, 2009Crossref Scholar). factors affect can be basically grouped into four categories (Haghighatlari 2020Haghighatlari M. Li J. Heidar-Zadeh F. Liu Y. Guan X. Head-Gordon T. Learning Make predictions: Interplay feature representation, methods.Chem. 2020; 6: 1527-1542Abstract Full Text PDF (18) Scholar): relevance capability algorithm (Figure 1A). first two pertain latter Depending physical domain problem, may vary their significance. case solubility, paucity measurement addition internal errors result from uncertainties procedures, well-known. Thus, priority interest when improving performance (Tetko 2001Tetko I.V. Tanchuk V.Y. Kasheva T.N. Villa A.E. E-state indices.J. Chem. Inf. Sci. 2001; 41: 1488-1493Crossref (296) Scholar; Jorgensen Duffy, 2002Jorgensen W.L. Duffy E.M. Prediction drug structure.Adv. Drug Deliv. Rev. 2002; 54: 355-366Crossref (538) Bergstroom 2004Bergstroom C.A.S. Wassvik C.M. Norinder U. Luthman K. Artursson P. Global local computational molecules.J. 2004; 44: 1477-1488Crossref (102) Balakin 2006Balakin K.V. Savchuk N.P. Tetko silico approaches DMSO compounds: trends, problems solutions.Curr. Med. 2006; 13: 223-241Crossref (132) Hewitt 2009Hewitt Cronin M.T. Enoch S.J. Madden J.C. solubility: challenge.J. Model. 2009; 49: 2572-2587Crossref (67) Wang Hou, 2011Wang Hou Recent advances prediction.Comb. High Throughput Screen. 14: 328-338Crossref (87) Falcon-Cano 2020Falcon-Cano G. Molina C. Cabrera-Pérez M.A. ADME KNIME: supervised recursive random forest approaches.ADMET DMPK. 8: 251-273Google generally accepted threshold context, stated cannot exceed (Jorgensen statement correct, further consolidated since (ML) algorithms capable dealing (Kordos Rusiecki, 2016Kordos Rusiecki Reducing noise impact MLP training.Soft 2016; 20: 49-65Crossref (14) To put differently, error set. improve algorithms, therefore important distinguish comprehend affecting them. Figure 1B shows decomposition We define would set zero error. contrast, demonstrated 1B). Obviously, one only performance. For instance, perfect model, by definition should predict absolute true values, ?, will despite zero. Therefore, domains accessible, enough ignored. However, decisive importance carefully treated. current work, develop we focus ML Starting design applying AqSolDB (Sorkun 2019Sorkun M.C. Khetan Er S. AqSolDB, curated reference 2D descriptors diverse compounds.Sci. Data. 2019; 1-8Crossref (19) Scholar) – multiple sources Model (AqSolPred) developed. AqSolPred superior compared conventionally used benchmark (Huuskonen, 2000Huuskonen organic topology.J. 2000; 40: 773-777Crossref (265) selection, comprises different namely Artificial Neural Network (ANN), Random Forest (RF), Extreme Gradient Boosting (XGB). Below, provide detailed description process, alongside links open-source codes data. following paragraphs, briefly review principal predictions. It well-known fact increasing instances positive effect Lusci al. trained UG-RNN datasets 1144, 1026, 74, 125 instances, obtained respective root mean squared (RMSEs) 0.58, 0.60, 0.96, 1.14 (Lusci 2013Lusci Pollastri Baldi Deep architectures deep chemoinformatics: 53: 1563-1575Crossref (289) noted yield impacts. While affects reliable evaluation accuracy. proper both large cover minimally affected outliers. values distribution similar example, (Yalkowsky Banerjee, 1992Yalkowsky S.H. Banerjee Solubility: Methods Organic Compounds. Marcel Dekker, 1992Google commonly literature Delaney, 2004Delaney J.S. ESOL: estimating directly structure.J. 1000-1005Crossref Dearden, 2006Dearden solubility.Expert Opin. Discov. 1: 31-52Crossref (97) 21 testing. Since had very few available, studies thousands hundreds (Balakin With increase public resources, such consisting ?104 compounds, becoming more feasible conduct accuracies Performing difficult task explained detail (Avdeef, 2020Avdeef intrinsic druglike regression WikipS0 database.ADMET 29-77Crossref (15) Additionally, unintentional misprints, erroneous conversions or units while carrying them source another, cause deterioration Unfortunately, information individual complete SD 0.5 0.6 LogS Recently, Avdeef determined average 870 Wiki-pS0 0.17 2019Avdeef Multi-lab reproducibility CheqSol shake-flask methods.ADMET 7: 210-219Crossref (5) quite distant conceded literature. keep mind specific differ significantly depending types contain. lowly soluble extremely measure (Hewitt thus high. Accordingly, expects contain many high SDs. essential determine prior Similar size, distinct assessment Test regulates 1). correctly evaluate vital use challenge (Llinas 2020Llinas Oprisiu I. Findings second solubility.J. 60: 4791-4803Crossref (6) qualities: (SD: LogS) low 0.62 LogS), shared participants were invited own methods. From 37 methods, RMSE high- low-quality 1.62 LogS, respectively. All performed worse This models, performances. partly compensated diversity models' smaller themselves. Descriptors mathematical representation contained compound. They valuable inputs aimed at properties. classified groups: 3D. Basically, require 3D optimization structure considered remaining descriptors. There several resources calculate (Yap, 2011Yap C.W. PaDEL-descriptor: open software fingerprints.J. 32: 1466-1474Crossref (1307) Moriwaki 2018Moriwaki H. Tian Y.S. Kawashita N. Takagi Mordred: descriptor calculator.J. Cheminform. 2018; 10: 4Crossref (198) Most calculated carry methodological approximations (Raevsky 2019Raevsky O.A. Veniamin Y.G. Polianczyk D.E. Raevskaja O.E. Dearden what do measure, QSPR predict?.Mini 19: 362-372Crossref Admitting information, atomic distances energy no clear evidence about impacts Gao 2020Gao Nguyen D.D. Sresht V. Mathiowetz A.M. Tu Wei G.W. Are fingerprints still discovery?.Phys. Phys. 22: 8373-8390Crossref Yan 2004Yan Gasteiger Krug Anzali Linear nonlinear functions methods.J. 18: 75-87Crossref (39) Salahinejad 2013Salahinejad Le T.C. Winkler D.A. prediction: crystal lattice interactions help?.Mol. Pharm. 2757-2766Crossref (42) preferred modest relevant avoid redundancy overfitting during (Wang earlier simple linear (LR) (Delaney, Hansch 1968Hansch Quinlan J.E. Lawrence G.L. free-energy relationship partition coefficients liquids.J. Org. 1968; 33: 347-350Crossref (389) Yalkowsky Valvani, 1980Yalkowsky Valvani S.C. partitioning I: nonelectrolytes water.J. 1980; 69: 912-922Abstract (488) Meylan 1996Meylan W.M. Howard P.H. Boethling R.S. Improved octanol/water coefficient.Environ. Toxicol. Int. 1996; 15: 100-106Crossref lipophilicity (LogP), melting point, weight. these easy interpret, predictive power rather LR works dependencies. last variations ANNs tree-based ensembles, proved ability solving complex research fields, also Huuskonen, Gasteiger, 2003Yan representation.J. 2003; 43: 429-434Crossref Schroeter 2007Schroeter T.S. Schwaighofer Mika Ter Laak Suelzle D. Ganzer Heinrich Müller K.R. Estimating applicability QSAR models: study discovery 2007; 21: 485-498Crossref (40) Tang 2020Tang B. Kramer S.T. Fang Qiu Wu Z. Xu self-attention message passing neural network predicting Cheminformatics. 12: 1-9Crossref (20) black-box nature, hard interpret humans. expert knowledge circumvent issues. As properly configured fed sufficient amount become competent Compared combines (Todeschini 2020Todeschini R. Consonni Ballabio Grisoni 4.25 - chemometrics modeling.in: Brown Tauler Walczak Comprehensive Chemometrics. Second Edition. Elsevier, 2020: 599-634http://www.sciencedirect.com/science/article/pii/B9780124095472147031Crossref aim compensate weaknesses each improved (Bergstroom Abshear 2006Abshear Banik G.M. D'Souza M.L. Nedwed Peng validation building environment.SAR Environ. Res. 17: 311-321Crossref (26) Chevillard 2012Chevillard Lagorce Reynès Villoutreix B.O. Vayer Miteva multimodel protocol similarity.Mol. 2012; 9: 3127-3135Crossref (22) Raevsky 2015Raevsky Grigorev comparative global models.Mol. Inform. 2015; 34: 417-430Crossref variances constituting uncertainties. phases shown 2. purposes, merges nine sub-datasets, named I, (Table Detailed sub-data provided accessible (https://doi.org/10.7910/DVN/OVHAW8) code curation (https://doi.org/10.24433/CO.1992938.v1).Table 1The its setsData setSizeFiltered sizeN(SD)SDA6110326630930.717B4651318512150.372C260317986680.380D211510541790.361E129112903370.274F121010112020.582G11443631700.392H5781481000.383I9462460.338All99826937–0.495Non-AF61544399–0.356Size, before pre-processing; Filtered after N(SD), SD; SD, standard deviation. Open table tab Size, above, differently. instead directly, applied procedure sub-dataset terms multi-lab described Methods. (N (SD)) SDs Table 1. significantly, numerical 0.274 0.717 LogS. E lowest highest Adversely, F close other <0.4 approach above. possible quality. selected dataset among sets. note that, known Huuskonen set, Using t-distributed stochastic neighbor embedding (t-SNE) dimensionality reduction technique (Maaten Hinton, 2008Maaten L.V.D. Hinton Visualizing t-SNE.J. Mach. Learn. 2008; 2579-2605Google validated largely covers reduced 3). compatible S1). After reserving removed sets, F, non-AF, obtained. non-AF incorporating constituent sub-datasets. comparison, entire All, same discussed positively correlated decreases analyze trade-off quality, developed separate fair combinations best configurations 10-fold cross-validation final ensuring process (see Methods), tested against 4). understand predictions, A-B D-F, found those having higher tha
منابع مشابه
Pushing the Limits of Translation Quality Estimation
Translation quality estimation is a task of growing importance in NLP, due to its potential to reduce post-editing human effort in disruptive ways. However, this potential is currently limited by the relatively low accuracy of existing systems. In this paper, we achieve remarkable improvements by exploiting synergies between the related tasks of word-level quality estimation and automatic post-...
متن کاملPushing the limits of crystallography
A very serious concern of scientists dealing with crystal structure refinement, including theoretical research, pertains to the characteristic bias in calculated versus measured diffraction intensities, observed particularly in the weak reflection regime. This bias is here attributed to corrective factors for phonons and, even more distinctly, phasons, and credible proof supporting this assumpt...
متن کاملPushing the limits of SPM
Breaking the speed limit One of the serious, inherent limitations in all forms of SPM is the low imaging rate. This is because each image is built up pixel by pixel in a sequential scan of the surface. The scanning motion involves electronically controlled mechanical displacements, and usually some form of feedback is applied between a control parameter, such as tunneling current or force, and ...
متن کاملPrediction of the pharmaceutical solubility in water and organic solvents via different soft computing models
Solubility data of solid in aqueous and different organic solvents are very important physicochemical properties considered in the design of the industrial processes and the theoretical studies. In this study, experimental solubility data of 666 pharmaceutical compounds in water and 712 pharmaceutical compounds in organic solvents were collected from different sources. Three different artificia...
متن کاملThe prediction of lymphedema via the combination of the selected data mining algorithms
Background: Breast cancer is the second leading cause of cancer death in women, after lung cancer. Due to the importance of predicting this disease, the use of data mining methods in medical research is more significant than before. Data mining algorithms can be a great help in preventing the development of lymphedema in patients. The aim Of this study was to create a diagnosis system that can ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: iScience
سال: 2021
ISSN: ['2589-0042']
DOI: https://doi.org/10.1016/j.isci.2020.101961